mgfuncfandomcom-20200216-history
Documentation
=Documentation= This section contains the documentation for MGFunc, with a tutorial on how to use it, and a step-by-step guide that explains each section in the config-file. Module 1 - Preparation Section 0: Beginning This section tells MGFunc where to save results and in which directories it can find the needed programs and files. # Specify the working directory where the results will be saved workdir = ./MGFuncrun # Path in your system for MGFunc scripts.pkgdir = /home/usr/bin/MGFunc # Path in your system to usearch executable (do not add the program name) If not given, default will be current directoryusearchpath = /home/usr/bin # Gene ontology .obo file with full pathgene_ontology_obo = /home/projectdir/gene_ontology.1_2.obo # If there has been a previous clustering done on the gene catalog, specify the precluster filepreclusterfile = /home/projectdir/genecatalog.preclust Section 1: FormatSwiss MGFunc works with two formats of UniProt database file that is downloaded in text format (.dat) and given as a main option to the MGFunc.py. The Uniprot DAT file (ex. uniprot.dat) can be converted to these formats (i.e. uniprot.tab and uniprot.fasta) in the formatswiss section. If you need the conversions to be made, -makefasta and -maketab options should be "True". If you have the converted files already, they should be in the same folder with the .dat file and the two options can be kept "False". It the options are "False" but the TAB and FASTA files are not found, the program will raise an error and exit. Explanations for options in formatswiss section are as follows: formatswiss -i: Specify the EMBL-formatted UniProt database file with the absolute path in the system. Examples: -i = DEFAULT -i = /home/projectdir/Uniprot/uniprot_all.dat -o: Basename for the output file. The extension will be automatically .fasta or .tab Examples: -o = DEFAULT -o = genecatalog -v: True/False -makefasta: True/False.If you want to convert the .dat file to fasta, write True. -maketab: True/False.If you want to convert the .dat file to tab, write True. This section can be ran separately using the "swiss2tab.py" and "swiss2fasta.py" scripts separetely. See help files by using -h option for these scripts or here: swiss2tab.py -h , swiss2fasta.py -h Module 2 - Homology Search Section 2. Sequence Alignment: UBLAST Uniprot In this section the gene sequences are searched in the Uniprot Protein database. It the “-run” option is kept “True” the pipeline will attempt a “usearch ublast” run on the system as a separate job or program than the python program. USEARCH UBLAST is a very fast alignment tool that can be run using a multithreading option “-threads” (USEARCH reference and user guide to ublast). If the “-threads” option for the Usearch program itself is not specified, the program uses the available maximum system resources to run, which is not ideal in all cases. Therefore the option “-t” in the configuration file helps the used select the maximum processors to use for the usearch run. The default value is 3 processors, however if the system requirements does not meet 3 CPUs, ublast will adjust itself. The pipeline incorporates this program through the shell commands that is inside MGFunc.py, however if the user would like to run the program separately, the small shell script “runusearch.sh” can be found in the package. As in a normal BLAST run, ublast also runs with an indexed database format that in this case that has to be created before the Ublast run. This is again another program in the USEARCH package, which is normally run by “usearch -makeudb_ublast” command. This process is also incorporated in MGFunc.py through shellscripting and can be controlled by the “makeudb” option in the configuration file. If the .udb file exists, the file name can be given with the absolute path in the option “uniprotudb”. However a separate small shell script can be found in the package with the name “makeublast.sh”. The running time of this section may be long depending on the database size and may result generates a large file (fx, Release 2013-10 for UniProtKB was a file with ~70GB size). But this step is presumably ran only once for a period, until another release of Uniprot is used. If the “makeudb” is set to “False” and the “uniprotudb” is set to DEFAULT, there will be an error message. The output of the usearch is set by the usearch output option “-blast6out” as a default in the pipeline and not subject to change through config file. ublast_uniprot ubevalue : E-value cut off option for the UBLAST program makeudb : True/False -t : Integer specifying the number of threads. uniprotudb : Specify UDB formatted database file for the UBLAST run. Example: uniprotudb = /home/projectdir/Uniprot/uniprot_all.udb Section 3. Sequence Alignment: UBLAST Genecatalog The UBLAST runs as mentioned above, only using the gene catalog as both input and database file. The ublast database is generated again as mentioned above however the option names are changed to “makegdb” for running the usearch command for database generation and “genecatudb” for providing the database file. Option “-t” again corresponds to the “-threads” option for the usearch program. ublast_genecatalog ubevalue : E-value cut off option for the UBLAST program makegdb : True/False -t : Integer specifying the number of threads. genecatudb : Specify UDB formatted genecatalog file as a database for the UBLAST run. Example: genecatudb = /home/projectdir/Genecat/genecatalog.udb Section 4 and 5. Selection of Alignments: BLASTdecide Uniprot and BLASTdecide Genecatalog BLAST outputs from Section 2 and 3 is parsed in the Section 4 and 5 respectively for the significant hits given the selected cut-offs. In this section, "bldecide.py" script is used and the parameters for this script can be set in the configuration file. This script contains more options when run standalone, however for the pipeline “-it” option is needed for the input, that a BLAST output file in tabular format, which is also the outcome of the ublast search (-blast6out). For BLAST parsing, a cut-off can be given to the the sequence similarity, coverage and e-value of the alignments when selecting the significant hits. Sequence similarity is calculated as the percentage fraction of similar and matching amino acids between the query and database sequence over the query length. Similar amino acids are the non-matching amino-acids in the alignment that share common physicochemical properties and result in a positive score in the subsequent scoring matrix, hence they are called positive substitutions. Coverage corresponds to the query coverage, which is the fraction of the alignment length over the query length in percentages. For this calculation the length of the query sequences are required and it is not provided in the BLAST output. The length files for each gene catalog file is calculated in the beginning of the pipeline and these files are used as a default “-l” value for the "bldecide.py". Since the pipeline generates these files on the initial steps, the related sections do not contain a “-l” option in the configuration file. Similarity and coverage are calculated while parsing the alignment results from the BLAST output within the "bldecide.py". The E-value or the expectation value on the other hand is calculated by the BLAST program and is a result of a statistical test that analyzes the probability of an alignment score being a result of chance. It represents the significance of an alignment score; as the e-value is gets lower, the likelihood of getting an alignment with the same score by chance also gets lower. E-values lower than 1e-5 have been used in many studies as a standard of finding significant hits. The default values in the pipeline are 50% for similarity, 50% for coverage and an E-value of 1e-10 which can be adjusted in the configuration file under both Section 4 and 5 with “-s”, “-q” and “-e” options respectively. blastdecide_uniprot or blastdecide_genecatalog -it : The result of UBLAST run (tab separated UBLAST output). If there are multiple file, the input will be a tab separated file, with the UBLAST results file in one column and the sequence lengths file on the second column. For other options in this section, see help files by using -h option for "bldecide.py" or here: bldecide.py -h Module 3 - Clustering Section 6. Clustergenes All genes in the gene catalog are subjected to a homology clustering based on reciprocal best-hits using the "clustergenes.py" script. The results of the blast parse is used as an input. A gene may have multiple significant hits, and some may be very similar to each other. The reciprocal best-hits are used to generate a Graph which then is used for clustering. The reciprocal hits for each protein has been identified (using 50/50 blast decide criteria by default) and a graph with nodes as the proteins and the edges as the reciprocal hits connection has been generated from a number of best hits for each protein (-b option in config). From this graph, connected components have been identified, corresponding to the clusters within the data, which can be interpreted as gene families. Each cluster is given a unique ID and the members of the clusters are reported. clustergenes -p : Blast parser output from Section 5. Keep as default if previous sections are ran with MGFunc. Otherwise write the file name. Example: -p = ./MGFuncrun/blastdecide/genecatalog.ALL.genesvsgenes.blastdecide -b : Number of best hits to be used in the reciprocal hits graph -m : Method type, reciprocal or trihits -a : Instead of best hits option, use all hits to make the reciprocal hits graph -o : Output name. Keep DEFAULT if you are running the pipeline. See help file for "clustergenes.py" by -h option or here : clustergenes.py -h Section 7. Clusterfilter Using the "clusterFilter.py" script the outcome of the clusters from the previous section can be selected by size (“-m”), number of unique samples (“-n”) and sample names (“-k”). clusterFilter -i : Cluster input file. Keep DEFAULT if you are running it in the pipeline. -n : Cut-off for unique number of samples in the cluster. Write an integer. fx. 3 -k : Name of the samples in the cluster to be kept and rest is filtered out. -p : Logical operator. When chosing option -k specify if samples should have an AND (&,AND,A) or OR (|,OR,O) relationship. -m : Cut-off for number of genes in the cluster. -o : Output basename. Keep to DEFAULT if you are not using the script separately. See help file for "clusterFilter.py" by -h option or here : clusterFilter.py -h Module 4 - Cluster Annotation Section 8. Parseclusterinfo Annotations for each cluster is based on the Uniprot hits of the genes in the cluster. In this section the uniprot.tab file is used to fetch the information related to the significant UniProt hits of the genes. The script that performs this section is called "parseclinfo.py". This script has many input options. The cluster file (“-c”), parsed results of ublast against UniProt (“-b”) and the Uniprot database files (“-ut”, “-ui”, “-uif”) and a list of all gene ids within the gene catalog (“-g”) are used. The script uses multithreading that is implemented with Python’s threading and Queue packages. This way each cluster information can be gathered in parallel. The script outputs many different files. One major output file is the cluster file that has the annotated UniProt IDs in a tab separated format as well as the GO, EC and KEGG IDs that are linked to the UniProt IDs. In this output, clusters are also flagged as “A”, “S” and “N” as explained in Table XX. *significant hits are decided by the cut-offs in Section 4. Metagenomics gene catalogs are generally undergone a homology-reduction step using a fast clustering method such as CH-HIT or UCLUST, which are very accurate in clustering very similar protein sequences. The output of these programs are generally a fasta file with representative sequences of clusters and another file with the specific information about clusters, members and alignment CIGARs..etc. The latter one is converted into a tab separated file with unique cluster ids, gene ids in the cluster and the representative gene ID. This reduction will be referred as pre-clusters and the clusters generated using reciprocal hits in the MGFunc will be referred as RBH-clusters. Gene catalogs that is used in this pipeline may infact be already undergone the homology-reduction step. Therefore, by using this information the genes within the RH-clusters can be expanded using their pre-cluster sizes and gene IDs. Eventually the pre-cluster IDs inherits the same annotations with the gene in the RH-cluster. parseclusterinfo -c : clusterfile, -b: genecat_vs_uniprot.blastdecide -bh : only best hit UniProt results. If False, all uniprot IDs will be displayed -i : Blastdecide indexing True or False. If False, index file will be assumed as .blastdecide.indexed in the same directory as blastdecide file -ut :Tab separated UniProt format. -ui :Uniprot tab file indexed format. -uif:Uniprot folder for the index files. -e : Expansion, if chosen, a file name with the preclusters are reqiured. -c : Gene catalog clusters file, output of Section 7. Keep DEFAULT if file name is not known. If you want to start running the pipeline from this section, specify the cluster file name. -bf: Blast decide folder. Only use if there are multiple blastdecide inputs that are all in this folder. Otherwise remove the option from the list. Example: -bf = home/projectdir/MGFuncrun/blastdecide -b : Name of the Blast decide file. If unknown keep DEFAULT. If there are multiple files, keep the -bf option as the folder name and -b option for file names and separate the file names in a comma(","). -g : Keep to DEFAULT. This file is the list of all geneids in the gene catalog. -t : Threading options, each cluster and gens that are not in clusters are utilized per thread. -na: True/False. Keep the genes with No Annotations in the analysis for the next step. -o : Output base name. Keep to DEAULT. See help file for "parseclinfo.py" by -h option or here : parseclinfo.py -h Section 9. Enrichment The functional agreement in clusters is an important issue, as it is the step where the clusters are being confirmed that the members of the cluster corresponds to functional orthologs. In the MGFunc pipeline, for the assessment of the clusters functional coherence, a GO Term enrichment step is implemented. This step makes use of a Python package called “goatools” that is described as “Tools for Gene Ontology”(https://github.com/tanghaibao/goatools). Goatools is an open source package that contains tools to make GO Term enrichment analysis using Fisher’s exact test and multiple test correction methods such as Bonferonni, Holms, Sidak and FDR is implemented. These were implemented in a script called “find_enrichment.py” in goatools and this script is wrapped in the MGFunc pipeline. The original version found in GitHub makes use of a two-tailed Fisher’s exact test via “fisher” package in python. However in MGFunc the uncorrected test statistic was converted to a one-tail Fisher’s exact test, using the right-tailed p-value as the output. The package also includes tools for reading and plotting GO lineages as well as mapping the terms to GOSlim terms however these have not been used in MGFunc. enrichment -i : Gene catalog cluster file that also contains annotations from the database. Keep DEFAULT if file name is not known. -a : Association file. A file with gene names and their GO terms in a tab separated format. -p : Population file. If you have an association file, population is generally the first column -s : Relative location or abs. path for the find_enrichment.py -o : Name of output file -t : Thread limit option. When worked with multiple threads, each thread works with one cluster's enrichment analysis. For a separate run of this step, "goawrapper.py" script can be used. See help file for "goawrapper.py" here: goawrapper.py -h Module 5 - Sequence Extraction Section 10. Cluster2Fasta Once the clusters are annotated with Uniprot IDs, the sequences that belong to each cluster are extracted from the gene catalog and UniProt FASTA files, and saved in a separate file, while the Cluster IDs of sequences are kept in the header along with gene names for the catalog and UniprotIDs for the database. For this process "cluster2fasta.py" is used. The input for this class is a cluster file, gene catalog and uniprot database files. cluster2fasta -c : cluster file name that is generated at the clusterFilter step. -ui : Indexed Uniprot FASTA file. ( Leave DEFAULT if unknown) Example: -ui=/home/projectdir/Uniprot/uniprot_all.dat.fasta.indexed -uf : Uniprot FASTA file ( Leave DEFAULT if unknown) Example: -f=/home/projectdir/Uniprot/uniprot_all.dat.fasta -ki : Indexed Gene catalog FASTA file ( Leave DEFAULT if unknown) -kf : Gene catalog FASTA file (Leave DEFAULT) -sfi: List of gene catalog index file and fasta file if there are multiple inputs. -o : Output name. (Leave DEFAULT if not decided) -t : Thread limit option. Maximum 3 threads can be selected. See help file for "cluster2fasta.py" here: cluster2fasta.py -h Section 11. Split Clusters In this section, sequences that belong to each cluster are extracted into separate FASTA files, while giving the cluster ID as a filename. The results are saved in the "splitclusters" folder. splitclusters # -i Input is fasta file with different clusters in it. # input can be given in a list, separated by comma, no spaces # -cl is the file that contains a list of cluster IDs, one ID per line # If multiple input files are given, clid files should be in the same order and amount # Output folder names can be given in list -i : Input is FASTA file containing all sequences from each cluster. Multiple inputs can be given, fx one sample per file. Then the input names should be separated by comma, no spaces. -cl: Input cluster file form a precious step. -o : Output name. Leave DEFAULT if not decided. -t : Thread limit option. Maximum 3 threads can be selected. See help for "splitclusters.py" by using -h or here: splitclusters.py -h Section 12. Change Headers The header files for a FASTA sequence can be changed here. In this change a unique short ID is given to each sequence in order to be able to run the subsequent alignment and tree sections properly. At this step, the clusters are already splitted, so there is one FASTA file per cluster. Therefore the input to change headers is the directory name for the FASTA files. changeheaders -i : input folder name that contains the FASTA files to be changed -n : output name for the new FASTA file -ht: header conversion table. New and old headers are listed in a tab sep. format. -T : Thread limit. Each thread runs for one cluster FASTA file. See help for "changeheaders.py" by using -h or here: changeheaders.py -h Module 6 - Phylogenetic Trees The final step of MGFunc pipeline is to create the phylogenetic trees for each cluster. Sequences that are previously separated for each cluster are first aligned using a multiple alignment tool, MUSCLE and then a phylogenetic tree is constructed using PAUP. Section 13. Run Alignment In this section, multiple alignments for each cluster FASTA file is performed with the -maxiter options of MUSCLE and the output is chosen as -clustalout. To run this section separately, you can use the "runalignments.sh" script provided with the package. runalignment -i : input folder name -fast : True/False. Fast alignment option for muscle is " --maxiter 3" normal/slower alignment option for muscle is " --maxiter 8" -T : Thread limit option. Each thread runs one alignment job with MUSCLE. Section 14. Trees and Treenames Since PAUP accepts only 30 characters in the ID of the sequence, all sequence headers were previously converted to a shorter version. These alignment files are first converted to a PIR format and then to a NEXUS format sequentially using the CLUSTALW program and the NEXUS file is used as an input to PAUP. The converted NEXUS file contains the sequence alignment as well as the commands for a neighbour joining tree in PAUP with a bootstrap test of 500 sampling. The file conversions are done internally in MGFunc, therefore, there is no external script to run the tree constructions. trees -T : Thread limit option. Each thread runs one tree construction with PAUP. The last step of tree making is the conversion of node names back to the original sequence IDs. This is handled by the treenames, where the user can also print out other IDs and functional names on the tree as well. Naming options are the UniProt IDs, UniProt descriptions, taxonomy or GO Term annotations on the tree leafs creating a better interpretation of the tree file. treenames -i : Header change tables as obtained in Section 12 -t : The tree files that are the outcome of the Paup run -U : Tab-formatted UniProt file -N : Node-names, can be 'D' (Description/function), 'G', (Gene name, DEFAULT), 'T', (taxonomy), or 'A', (All, tax,descript,genename). -o : Output basename -format : Output format for the final tree : newick or phyloxml -c : True/False. Clean option, delete all temporary files See help for "treenames.py" by using -h or here: treenames.py -h